Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Replace hpc-stack with spack-stack #913

Merged
merged 34 commits into from
Oct 6, 2023

Conversation

RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA commented Sep 14, 2023

DESCRIPTION OF CHANGES:

Replaced use of hpc-stack with spack-stack version 1.4.1
In modulefiles directory, build* and wflow* scripts are updated. Also removed srw_common_spack.lua and all *.lua files are calling same srw_common.lua file.

Ran fundamental tests on Hera, Jet, Gaea and Orion

NOTE1 there is need to fix MET and METplus for this PR (---fixed---)
NOTE2 using load_any, we still support machines with hpc-stack, with only one srw_common.lua. Once hpc-stack is removed, we can remove load_any and have simpler srw_common.lua modulefile

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

Once all machines switch to spack-stack, documentation on hpc-stack may be removed.

ISSUE:

Solves issue #912, #905

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS:

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS:

@natalie-perlin
@EdwardSnyder-NOAA

@RatkoVasic-NOAA
Copy link
Collaborator Author

RatkoVasic-NOAA commented Sep 14, 2023

@EdwardSnyder-NOAA is working on fixing issue with METPLUS_PATH, issue #905

@MichaelLueken MichaelLueken changed the title Replace hpc-stack with spack-stack [develop] Replace hpc-stack with spack-stack Sep 14, 2023
@MichaelLueken MichaelLueken linked an issue Sep 14, 2023 that may be closed by this pull request
RatkoVasic-NOAA and others added 3 commits September 19, 2023 09:06
Conflicts:
	modulefiles/build_gaea_intel.lua
	modulefiles/build_hera_gnu.lua
	modulefiles/build_hera_intel.lua
	modulefiles/build_jet_intel.lua
	modulefiles/build_orion_intel.lua
	modulefiles/srw_common.lua
	modulefiles/srw_common_spack.lua
@MichaelLueken
Copy link
Collaborator

@RatkoVasic-NOAA - Thanks for the heads up! Are you going to include a spack-stack build for Derecho as well, or just Hercules and Gaea C5? Just wanted to check and see which platforms would be included with this initial spack-stack transition.

@RatkoVasic-NOAA
Copy link
Collaborator Author

RatkoVasic-NOAA commented Oct 2, 2023 via email

@EdwardSnyder-NOAA
Copy link
Collaborator

I was able to successfully run vx cases on the NOAA cloud platforms (see below for details) after making a change to the ufs-srweather-app/modulefiles/tasks/noaacloud/python_srw.lua file. The change was pushed to the PR.

  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP on GCP
  • grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 on Azure
  • grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 on AWS

Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran a few met vx and/or fundamental tests without issue on Gaea, Hera, Hercules, Jet, NOAA Cloud, and Derecho. Approving.

@natalie-perlin natalie-perlin self-requested a review October 5, 2023 14:01
@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Oct 5, 2023

Suggesting some cleanup for Gaea and longer time for run_MET_PcpCombine_fcst_APCP* tasks that timed out in one of my tests.

  1. ./modulefile/wflow_gaea.lua , remove the following lines:
    local MKLROOT="/opt/intel/oneapi/mkl/2023.1.0/"
    prepend_path("LD_LIBRARY_PATH",pathJoin(MKLROOT,"lib/intel64"))
    pushenv("MKLROOT", MKLROOT)
    pushenv("GSI_BINARY_SOURCE_DIR", "/lustre/f2/dev/role.epic/contrib/GSI_data/fix/20230601")
    setenv("PMI_NO_PREINITIALIZE","1")

For spack-stack, intel-classic/2022.02 was used. Removing the last line with "PMP_NO_PREINITIALIZE" also allows to remove the --mpi=pmi2 in gaea.yaml machine file, in RUN_CMD_UTILS

  1. ./ush/machine/gaea.yaml, set srun command:
    RUN_CMD_UTILS: srun --export=ALL -n $nprocs

  2. ./parm/wflow/verify_pre.yaml- increase walltime for run_MET_PcpCombine_fcst_APCP* tasks, from 5 min to 10min (last line in the code snippet below):

    task_run_MET_PcpCombine_fcst_APCP#ACCUM_HH#h_mem#mem#:
    <<: *default_task_verify_pre
    attrs:
    cycledefs: forecast
    maxtries: '2'
    command: '&LOAD_MODULES_RUN_TASK_FP; "run_vx" "&JOBSdir;/JREGIONAL_RUN_MET_PCPCOMBINE"'
    envars:
    <<: *default_vars
    VAR: APCP
    ACCUM_HH: '#ACCUM_HH#'
    obs_or_fcst: fcst
    OBTYPE: CCPA
    OBS_DIR: '&CCPA_OBS_DIR;'
    MET_TOOL: 'PCPCOMBINE'
    ENSMEM_INDX: "#mem#"
    dependency:
    datadep:
    attrs:
    age: 00:00:00:30
    text: !cycstr '{{ workflow.EXPTDIR }}/@y@m@d@H/post_files_exist_mem#mem#.txt'
    walltime: 00:10:00

@natalie-perlin
Copy link
Collaborator

Timing of the run_MET_PcpCombine_fcst_APCP* tasks in grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test before the time increase:

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
... skipped...
202105121200       run_MET_Pb2nc_obs                   269362276           SUCCEEDED                   0         1         265.0
202105121200    run_MET_PcpCombine_obs_APCP01h                   269362277           SUCCEEDED                   0         1         232.0
202105121200    run_MET_PcpCombine_obs_APCP03h                   269362278           SUCCEEDED                   0         1         230.0
202105121200    run_MET_PcpCombine_obs_APCP06h                   269362279           SUCCEEDED                   0         1         229.0
202105121200    check_post_output_mem001                   269362427           SUCCEEDED                   0         1         255.0
202105121200    check_post_output_mem002                   269362428           SUCCEEDED                   0         1         254.0
202105121200    run_MET_PcpCombine_fcst_APCP01h_mem001                   269362520                DEAD                  15         2         305.0
202105121200    run_MET_PcpCombine_fcst_APCP01h_mem002                   269362521                DEAD                  15         2         302.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem001                   269362522                DEAD                  15         2         302.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem002                   269362523                DEAD                  15         2         302.0
202105121200    run_MET_PcpCombine_fcst_APCP06h_mem001                   269362524                DEAD                  15         2         302.0
202105121200    run_MET_PcpCombine_fcst_APCP06h_mem002                   269362525                DEAD                  15         2         302.0

Rerunning the tasks after increase of walltime=10min:

202105121200    run_MET_PcpCombine_fcst_APCP01h_mem001                   269362661           SUCCEEDED                   0         1         389.0
202105121200    run_MET_PcpCombine_fcst_APCP01h_mem002                   269362664           SUCCEEDED                   0         1         389.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem001                   269362666           SUCCEEDED                   0         1         385.0
202105121200    run_MET_PcpCombine_fcst_APCP03h_mem002                   269362667           SUCCEEDED                   0         1         382.0
202105121200    run_MET_PcpCombine_fcst_APCP06h_mem001                   269362668           SUCCEEDED                   0         1         382.0
202105121200    run_MET_PcpCombine_fcst_APCP06h_mem002                   269362669           SUCCEEDED                   0         1         382.0

 

@RatkoVasic-NOAA
Copy link
Collaborator Author

  1. ./modulefile/wflow_gaea.lua , remove the following lines:
    local MKLROOT="/opt/intel/oneapi/mkl/2023.1.0/"
    prepend_path("LD_LIBRARY_PATH",pathJoin(MKLROOT,"lib/intel64"))
    pushenv("MKLROOT", MKLROOT)
    pushenv("GSI_BINARY_SOURCE_DIR", "/lustre/f2/dev/role.epic/contrib/GSI_data/fix/20230601")
    setenv("PMI_NO_PREINITIALIZE","1")
  2. ./ush/machine/gaea.yaml, set srun command:
    RUN_CMD_UTILS: srun --export=ALL -n $nprocs
  3. ./parm/wflow/verify_pre.yaml- increase walltime for run_MET_PcpCombine_fcst_APCP* tasks, from 5 min to 10min (last line in the code snippet below):

@natalie-perlin Done.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RatkoVasic-NOAA - These changes look good to me! I was also able to successfully run the fundamental tests on Derecho without issue:

----------------------------------------------------------------------------------------------------
Experiment name                         | Status  | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta  COMPLETE       16.34
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE       22.33
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2    COMPLETE       13.00
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE       26.16
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR     COMPLETE       29.88
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0       COMPLETE       30.04
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16        COMPLETE       42.25
----------------------------------------------------------------------------------------------------
Total                               COMPLETE       180.00

Approving now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Oct 5, 2023
@MichaelLueken
Copy link
Collaborator

@RatkoVasic-NOAA - The Hera GNU build has failed in Jenkins. The message is:

The following modules are unknown:
yafyaml/v0.5.1, zlib/1.2.11, openblas/0.3.23, netcdf/4.9.2, nccmp/1.9.1.0, gftl-shared/v1.5.0, and jasper/2.0.25.

If you would like to see the Jenkins pipeline for this PR, please see:
https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-913/1/pipeline/243

Looking at the versions, openblas in build_hera_gnu.lua should be set to 0.3.19, nccmp in build_hera_gnu.lua should be set to 1.9.0.1. However, the rest of the versions are present in the load_any sections of the srw_common.lua modulefile, so I'm not sure why there are issues is for the rest of the modules.

The workspace on Hera that contains the failed Jenkins test directory is:
/scratch2/NAGAPE/epic/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-913

@RatkoVasic-NOAA
Copy link
Collaborator Author

@MichaelLueken fixed and committed.

@MichaelLueken
Copy link
Collaborator

Thanks, @RatkoVasic-NOAA! A test build using the Jenkins build scripts shows that the updated version should now build. I requeued the Hera GNU test in Jenkins.

@MichaelLueken
Copy link
Collaborator

One test failed on Hera Intel:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    DEAD                   7.28
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.45
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             768.30
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.44
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               6.41
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              13.50
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              10.37
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               7.21
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             232.81
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             306.96
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             328.09
pregen_grid_orog_sfc_climo                                         COMPLETE               8.05
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                1709.87

Manual rerun of the custom_ESGgrid_Central_Asia_3km test passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              26.56
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              26.56

@natalie-perlin
Copy link
Collaborator

Tested and successfully ran fundamental tests on Gaea, Gaea C5 (still with hpc-stack), Orion, and Hercules.

@natalie-perlin
Copy link
Collaborator

Derecho and Gaea-c5 are not yet ready for spack-stack transition, remaining with hpc-stack at the moment

@MichaelLueken
Copy link
Collaborator

The Jenkins tests have successfully passed on Gaea, Gaea-C5, Hercules, Jet, and Orion. I have requeued the Hera GNU tests, which failed in the Functional Workflow Task Tests stage. Once the test completed, I will merge this PR.

@MichaelLueken
Copy link
Collaborator

The WE2E coverage tests were successfully run on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.75
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              35.80
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              42.60
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              27.08
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              17.12
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.60
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.50
pregen_grid_orog_sfc_climo                                         COMPLETE              14.05
specify_template_filenames                                         COMPLETE              14.52
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             234.02

@MichaelLueken
Copy link
Collaborator

The Hera GNU tests have successfully passed with this morning's requeue:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km                                           COMPLETE              26.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200         COMPLETE              12.36
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE              12.64
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              45.59
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              28.31
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              21.41
long_fcst                                                          COMPLETE              72.55
MET_verification_only_vx                                           COMPLETE               0.25
MET_ensemble_verification_only_vx_time_lag                         COMPLETE               7.97
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              62.33
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             289.86

With this, all tests have successfully passed. Moving forward with merging this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Replace HPC-stack with SPACK-stack Use MET/METplus modules on NOAA Cloud
4 participants